Dear Admission Officer,
Appreciate for your time.
This report is a brief application of a time series analysis. I learned such mathematical theory and programming all by myself without external guide, and the document is just a personal note rather than a formal paper, so please be patient if any mistake happened. Besides, my original purpose is to apply the ARIMA+GARCH model in practice (“for fun”) rather than to build a high performance investment model, thus little financial strategy is applied here.
In fact, I was a full-time professional in Ernst & Young, one of the big4 firms, as a certified public accountant. Perhaps accounting and corporate finance is the area of my expertise. Nonetheless, I expect more from myself as I decide to expand my intellectual horizon to build an inspired mind. So here I am just willing to show my attempt and convey my passion, efforts, dedication and determination for your program.
I hope to see you in the near future.
Best regards,
MA Haoran
The purpose is to build a time series model and predict the Hang Seng Index (HSI). Firstly, get the HSI index and build the training data. Then, explore the data and capture the features for a model. Find the ARIMA model with the least AIC and fit the residuals by GARCH model. Finally, make a 10 days forecast, compare the outcome with testing data.
#loading package
library(tidyquant)
library(fpp3)
library(tseries)
library(rugarch)
library(data.table)
library(plotly)
setwd("C:/Users/Apple/Desktop/RStudio Tour/note/TSA")
The historical data is downloaded from Yahoo Finance([ctrl+click] to attach link) The training data covered from 2010-01-05 to 2021-01-29.
HSI <- fread("^HSI.csv")
HSI <- HSI[, .(Date, Close = as.numeric(Close))]
HSI <- HSI %>%
tsibble(index = Date) %>%
mutate(Return = difference(log(HSI$Close)) * 100)
HSI <- HSI %>%
na.omit() %>%
mutate(time = row_number()) %>%
update_tsibble(index = time)
head(HSI)
## # A tsibble: 6 x 4 [1]
## Date Close Return time
## <date> <dbl> <dbl> <int>
## 1 2010-01-05 22280. 2.07 1
## 2 2010-01-06 22417. 0.613 2
## 3 2010-01-07 22269. -0.659 3
## 4 2010-01-08 22297. 0.123 4
## 5 2010-01-11 22412. 0.513 5
## 6 2010-01-12 22327. -0.379 6
HSI %>%
ggplot(aes(x = Date, y = Close)) +
geom_line() +
theme_tq() +
labs(title = "HSI Index (date data)")
According to the plot, the Close Price is not stationary. In order to fit an ARIMA model, it could be helpful to difference the logarithm of the initial price.
Return = (ln(Close Price (t+1)) - ln(Close Price(t))) *100
HSI %>%
ggplot(aes(x = Date, y = Return)) +
geom_line() +
theme_tq() +
labs(title = "HSI Return (date data)")
The Return seems stationary without seasonality.
One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required.
H0:the data is stationary around a deterministic trend.
HSI%>%features(Return,unitroot_kpss)
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0308 0.1
KPSS test Pvalue = 0.1 > 0.05, so the null hypothesis is accepted.
H0: a unit root is present.
H1: time series is stationary.
adf.test(HSI$Return, alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: HSI$Return
## Dickey-Fuller = -13.999, Lag order = 13, p-value = 0.01
## alternative hypothesis: stationary
ADF test Pvalue = 0.01 < 0.05, so the null hypothesis is rejected.
In conclusion, the time series is stationary for an ARIMA model.
arima.mod <- HSI %>%
model(ARIMA(Return, stepwise = FALSE, approximation = FALSE))
arima.mod
## # A mable: 1 x 1
## `ARIMA(Return, stepwise = FALSE, approximation = FALSE)`
## <model>
## 1 <ARIMA(2,0,3)>
arima.mod%>%
gg_tsresiduals()
H0: Data cannot be distinguished from white noise.
augment(arima.mod)%>%features(.innov,ljung_box)
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Return, stepwise = FALSE, approximation = FALSE) 0.0101 0.920
According to the plot and ljung-box test, pvalue>0.05, so the residuals of the model cannot be distinguished from white noise. It is rational to accept ARIMA(2,0,3).
There would be an ARCH effect if the acf and pacf of residuals^2 are significant.
p1 <- arima.mod %>%
augment() %>%
select(.innov) %>%
ACF(.innov ^ 2) %>%
autoplot() +
labs(y = "resid^2 acf")
p2 <- arima.mod %>%
augment() %>%
select(.innov) %>%
PACF(.innov ^ 2) %>%
autoplot() +
labs(y = "resid^2 pacf")
gridExtra::grid.arrange(p1, p2)
Thus the ARCH effect on residuals is significant, it plausible to fit a GARCH model on residuals.
Try model: ARIMA(2,0,3)+GARCH(1,1)
spec <-
ugarchspec(
variance.model = list(
model = "sGARCH",
garchOrder = c(1, 1),
submodel = NULL,
external.regressors = NULL,
variance.targeting = FALSE
),
mean.model = list(armaOrder = c(2, 3),
include.mean = TRUE),
distribution.model = "sged"
)
fit <- ugarchfit(spec, HSI$Return,
solver = "hybrid")
Report the final model
print(fit)
##
## *---------------------------------*
## * GARCH Model Fit *
## *---------------------------------*
##
## Conditional Variance Dynamics
## -----------------------------------
## GARCH Model : sGARCH(1,1)
## Mean Model : ARFIMA(2,0,3)
## Distribution : sged
##
## Optimal Parameters
## ------------------------------------
## Estimate Std. Error t value Pr(>|t|)
## mu 0.025299 0.018994 1.33197 0.182871
## ar1 -1.312107 0.015853 -82.76469 0.000000
## ar2 -0.440748 0.020674 -21.31875 0.000000
## ma1 1.311944 0.002183 601.05156 0.000000
## ma2 0.429959 0.029283 14.68293 0.000000
## ma3 0.006806 0.011885 0.57264 0.566887
## omega 0.020438 0.007478 2.73295 0.006277
## alpha1 0.051404 0.009379 5.48086 0.000000
## beta1 0.933010 0.012840 72.66518 0.000000
## skew 0.916142 0.021129 43.36049 0.000000
## shape 1.383882 0.053582 25.82747 0.000000
##
## Robust Standard Errors:
## Estimate Std. Error t value Pr(>|t|)
## mu 0.025299 0.019449 1.30079 0.193331
## ar1 -1.312107 0.012415 -105.69088 0.000000
## ar2 -0.440748 0.016284 -27.06573 0.000000
## ma1 1.311944 0.002901 452.27417 0.000000
## ma2 0.429959 0.016539 25.99719 0.000000
## ma3 0.006806 0.006901 0.98623 0.324022
## omega 0.020438 0.008056 2.53687 0.011185
## alpha1 0.051404 0.011494 4.47233 0.000008
## beta1 0.933010 0.014748 63.26441 0.000000
## skew 0.916142 0.022258 41.16010 0.000000
## shape 1.383882 0.055471 24.94806 0.000000
##
## LogLikelihood : -4046.441
##
## Information Criteria
## ------------------------------------
##
## Akaike 2.9933
## Bayes 3.0173
## Shibata 2.9933
## Hannan-Quinn 3.0020
##
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
## statistic p-value
## Lag[1] 3.696 0.05453
## Lag[2*(p+q)+(p+q)-1][14] 8.013 0.19343
## Lag[4*(p+q)+(p+q)-1][24] 11.670 0.59853
## d.o.f=5
## H0 : No serial correlation
##
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
## statistic p-value
## Lag[1] 2.893 0.08898
## Lag[2*(p+q)+(p+q)-1][5] 6.596 0.06492
## Lag[4*(p+q)+(p+q)-1][9] 8.511 0.10215
## d.o.f=2
##
## Weighted ARCH LM Tests
## ------------------------------------
## Statistic Shape Scale P-Value
## ARCH Lag[3] 1.928 0.500 2.000 0.1650
## ARCH Lag[5] 2.589 1.440 1.667 0.3551
## ARCH Lag[7] 3.664 2.315 1.543 0.3976
##
## Nyblom stability test
## ------------------------------------
## Joint Statistic: 1.5655
## Individual Statistics:
## mu 0.10650
## ar1 0.06487
## ar2 0.06350
## ma1 0.08800
## ma2 0.08856
## ma3 0.12493
## omega 0.09098
## alpha1 0.07925
## beta1 0.07443
## skew 0.18009
## shape 0.27474
##
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic: 2.49 2.75 3.27
## Individual Statistic: 0.35 0.47 0.75
##
## Sign Bias Test
## ------------------------------------
## t-value prob sig
## Sign Bias 0.6568 0.51138
## Negative Sign Bias 0.2529 0.80038
## Positive Sign Bias 2.0783 0.03778 **
## Joint Effect 10.3007 0.01618 **
##
##
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
## group statistic p-value(g-1)
## 1 20 20.06 0.3911
## 2 30 32.83 0.2845
## 3 40 38.35 0.4995
## 4 50 56.52 0.2146
##
##
## Elapsed time : 4.874232
Build a forecast function to predict return in 10 days.
fore <- function(input, p, q, n) {
spec <-
ugarchspec(
variance.model = list(
model = "sGARCH",
garchOrder = c(1, 1),
submodel = NULL,
external.regressors = NULL,
variance.targeting = FALSE
),
mean.model = list(armaOrder = c(p, q),
include.mean = TRUE),
distribution.model = "sged"
)
fit <- ugarchfit(spec, input$Return,
solver = "hybrid")
fore <- ugarchforecast(fit, n.ahead = n)
pred <- fore@forecast$seriesFor
pred <- as.data.table(pred)
setnames(pred, "Return")
pred[, time := seq.int(from = 1, to = n, by = 1)][]
}
pred <- fore(HSI, 2, 3, 10)
#View the predictions
pred
## Return time
## 1: 0.154925180 1
## 2: -0.109081450 2
## 3: 0.138608698 3
## 4: -0.064147019 4
## 5: 0.092721230 5
## 6: -0.023742514 6
## 7: 0.059931009 7
## 8: 0.001473559 8
## 9: 0.041297048 9
## 10: 0.014809375 10
Examine the predictions with testing set. The raw test data starts from 2021-01-29, the last day of the traing set.
test <- fread("test.csv")
test <- test[, .(Date, Close = as.numeric(Close))]
test <- test %>%
tsibble(index = Date) %>%
mutate(Return = difference(log(test$Close)) * 100)
test <- test %>%
na.omit() %>%
mutate(time = row_number()) %>%
update_tsibble(index = time)
# Preview the test data
head(test, 5)
## # A tsibble: 5 x 4 [1]
## Date Close Return time
## <date> <dbl> <dbl> <int>
## 1 2021-02-01 28893. 2.13 1
## 2 2021-02-02 29249. 1.22 2
## 3 2021-02-03 29307. 0.201 3
## 4 2021-02-04 29114. -0.664 4
## 5 2021-02-05 29289. 0.600 5
Plot forecast value and test value.
foretest <- function(pred, test) {
p <- pred %>%
tsibble(index = time) %>%
left_join(test, by = "time") %>%
mutate(Pred = Return.x,
Actual = Return.y) %>%
select(time, Date, Close, Pred, Actual) %>%
pivot_longer(
cols = c(Pred, Actual),
values_to = "Return",
names_to = "Type"
) %>%
ggplot(aes(x = Date, y = Return, color = Type)) +
geom_line() +
scale_x_date(date_minor_breaks = "1 day") +
theme_tq()
ggplotly(p)
}
foretest(pred, test)
Discrepancy could easily be observed, the actual line seems 1 lag behind the prediction line, but the locations of inflection points are pretty close. (Perhaps this is caused by effective market… anyway, effective market maybe a good news for society. Just kidding!)
Of course the model has to be improved. Cross-validation would be useful, and it might be fun to use bootstrap and function hilo() to draw the prediction interval.
To assure the reproducibility, all code and data have been uploaded to Github([ctrl+click] to attach link)